Phase 6.4: epoll-driven RX dispatch with adaptive timeout (3.5× throughput vs main) by dpsoft · Pull Request #69 · the-void-ia/void-box

dpsoft · 2026-05-03T21:01:46Z

Status: DRAFT. Stacked on top of PR #68 (`smoltcp-passt-port-phase0` — Phases 0–5 + 5.5b). Implements Phase 6.4 of `docs/superpowers/plans/2026-04-30-smoltcp-passt-port.md` per the detailed TDD plan in `docs/superpowers/plans/2026-04-30-smoltcp-passt-port-phase6.4.md`.

What this branch does

Replaces `net_poll_thread`'s timer-driven `sleep(5 ms)` cycle with an event-driven `epoll_wait` dispatch using an adaptive timeout (5 ms when active, 50 ms when idle). Host TCP/UDP/ICMP socket FDs are registered with a shared `EpollDispatch` on every `flow_table` insert; `net_poll_thread` blocks on `epoll_wait` and feeds ready events to `SlirpBackend::drain_to_guest` via a small per-backend event queue. A self-pipe `Waker` lets `process_guest_frame` (vCPU thread) wake the poll thread immediately when it queues frames.

Headline perf vs `origin/main` (`voidbox-network-bench --iterations 3 --bulk-mb 10`)

Metric	Master	This branch	Δ
TCP g2h throughput	1885 Mbps	6680 Mbps	+254% (3.5×)
TCP bulk-g2h (SO_RCVBUF=4096)	1565 Mbps	5650 Mbps	+261% (3.6×)
TCP CRR p50	~10 ms	~10 ms	parity
TCP RR p50	2 µs	1-3 µs	parity (within noise)

Divan microbench wins (`scripts/bench-compare.sh --baseline `47868f0` --skip-vm`)

The big-N case is the design's headline win — only ready flows get visited, no full-table scan:

Bench	Baseline	This branch	Δ%
`poll_with_n_mixed_flows/999`	304 µs	8.5 µs	-97%
`poll_with_n_mixed_flows/99`	29.3 µs	3.81 µs	-87%
`poll_with_n_mixed_flows/3`	1.11 µs	651 ns	-41%
`flow_table_insert_remove/1000`	23.3 µs	24.4 µs	+5%
`tcp_bulk_throughput_1mb`	59.1 ms	59.8 ms	+1% (parity)

Adaptive timeout: how it preserves correctness

The naive `epoll_wait(5 ms fixed)` recovered CRR but burned 200 wakes/sec when idle. The naive `epoll_wait(50 ms fixed)` recovered idle CPU but caused +40 ms CRR latency (Linux delayed-ACK timer).

Adaptive: last cycle had any kernel event ⇒ next timeout = 5 ms (active); last cycle pure timeout ⇒ next timeout = 50 ms (idle). One quiet cycle drops to idle, one event back to active.

The subtle part: when `epoll_waker.wake()` fires during a 50 ms idle wait, the kernel's `epoll_wait` returns with the self-pipe event. `wait_with_timeout` filters that event out — so a naive "is_empty ⇒ idle" predicate would have stayed at 50 ms, regressing CRR. `wait_with_timeout` now returns the raw kernel count (including filtered self-pipe wakes); adaptive treats wakes as activity. Filtered events still arrive in the out parameter unchanged.

Inherent microbench overhead (test/bench fallback path only)

Microbenches without a `net_poll_thread` fall back to `drain_to_guest`'s self-poll path (one mutex op + one non-blocking `epoll_wait` syscall per call, ~310 ns). This does NOT apply to production, where `has_external_poller=true` after the first `push_ready_events` skips the syscall (~50 ns total).

Bench (test path)	Baseline	This branch	Note
`poll_idle`	160 ns	441 ns	inherent fallback cost
`poll_with_n_flows/*`	159-180 ns	440-441 ns	inherent fallback cost
`dns_cache_hit`	190 ns	1.03 µs	inherent fallback cost
`process_arp_request`	4.91 µs	18.1 µs	EpollDispatch::new() per iteration in benches that build fresh stacks (~10 µs setup amortized in production once per VM)

Iteration history

The first cut of Phase 6.4 caused a 40× throughput regression in production (1885 → 44 Mbps). Six fix passes with eBPF profiling (perf-agent) restored it:

`ed048e5` — eliminate epoll mutex contention. `net_poll_thread` held the EpollDispatch mutex for the full 50 ms of its blocking wait; `drain_to_guest` contended on the same mutex. Per-backend `pending_events` queue replaces the contention.
`17b437b` — replace two-pass relay sweep with lazy close queue. The Pass-1 `flow_table.keys().copied().collect()` was 47 cache misses/1K instructions — now FIN/RST handlers push to `pending_close` directly.
`bdef4bd` — drain `pending_events` first, fall back to non-blocking poll only if empty. The original try_lock-vs-pending_events branch dropped events the net-poll thread had already pushed.
`15231cb` — wake `net_poll_thread` when `process_guest_frame` queues frames. Pre-Phase-6.4 the 5 ms sleep masked the absence of an explicit wake; with `epoll_wait` waiting on FD readiness, queued ACKs sat 50 ms before flushing because the guest is the writer (no FD-side signal). This was the throughput unblocker (44 → 6000 Mbps).
`bebeb30` — `epoll_wait` timeout 50 ms → 5 ms. CRR p50 was stuck at 51 ms (one full epoll cycle) because the Linux guest relies on regular IRQ pulses to advance its TCP delayed-ACK timer. 5 ms cadence mirrors pre-Phase-6.4; fast paths still wake immediately via epoll readiness. Restored CRR to 10 ms baseline.
`1d3e816` — added `has_external_poller` flag (skips the fallback syscall when `push_ready_events` is feeding events) and stripped `Phase-N` references from inline comments per AGENTS.md style guide.
`9a46865` — adaptive 5 ms / 50 ms timeout. Recovers Phase 6.4's headline 10× idle-wakeup reduction while preserving CRR parity. Required exposing raw kernel event count from `wait_with_timeout` so adaptive treats self-pipe wakes as activity.

Profile evidence (post-fix, 60 s capture during voidbox-network-bench)

IPC: 0.544 (was 0.413 pre-fix; threshold for memory-bound is <0.8 — improved, still bound).
Cache misses: 38.1 / 1K (was 47.4; threshold 10 — improved, still high — would need data-structure work to push lower).
P99 on-CPU: 0.059 ms (was 50.7 ms pre-fix — the lock-contention tail is gone).
`Mutex::lock_contended` on-CPU: 1.34 % (was the dominant cost pre-fix).
`net_poll_thread` on-CPU at fixed 5 ms cadence: 4.93 % of total CPU. Adaptive brings this to ~10× lower during idle stretches — the headline 10× idle-wakeup reduction is recovered.

Validation

Suite	Status
`cargo fmt --all -- --check`	✅
`cargo clippy --workspace --all-targets --all-features -- -D warnings`	✅
`cargo test --test network_baseline`	✅ 18/18 (default), 19/19 with `--features bench-helpers -- --test-threads=1`
`cargo test --lib network::epoll_dispatch`	✅ 5/5
`cargo bench --bench network --features bench-helpers --no-run`	✅
`cargo build --release`	✅
`voidbox-network-bench --iterations 3 --bulk-mb 10`	✅ 3.5× throughput vs master, RR/CRR parity
eBPF profile (`~/.local/bin/perf-agent`)	✅ post-fix off-CPU `drain_to_guest` near zero (was 9% pre-fix)

Open follow-ups (deferred to future PRs, NOT blocking)

Phase 6.1 (TCP half-close), Phase 6.2 (async connect), Phase 6.3 (window mgmt) — separate plans, all deferred per overview at `smoltcp-passt-port-phase6.md`.
`port_forward_accept_latency` stays at 50 ms (listener thread polls every 50 ms; un-touched by Phase 6.4). Moving the listener accept onto the same epoll set would drop this to microseconds — small follow-up, not in 6.4 scope.
Cache-miss rate 38 / 1K would benefit from a faster HashMap hasher or contiguous `FlowEntry` storage. Worthwhile but invasive; deserves its own focused work.

Test plan for review

`cargo test --test network_baseline` passes 18/18.
`cargo test --test network_baseline --features bench-helpers -- --test-threads=1` passes 19/19 (the snapshot-rebuild smoke pin).
`cargo bench --bench network --features bench-helpers tcp_rx_latency_one_packet` produces a sub-millisecond median.
`scripts/bench-compare.sh --baseline 47868f0 --skip-vm` shows the wins above and the documented overheads.
`voidbox-network-bench --iterations 3 --bulk-mb 10` shows g2h ~6680 Mbps, bulk-g2h ~5650 Mbps, CRR ~10 ms.
Skim `docs/superpowers/plans/2026-04-30-smoltcp-passt-port-phase6.4.md` for the design rationale and the iteration history.

Adds two planning docs under docs/superpowers/plans/: - 2026-04-27-smoltcp-passt-port.md (spec) Supersedes the 2026-04-12 network-backend-abstraction design. Replaces "add passt as opt-in backend" with "lift passt's design patterns into our smoltcp stack" — keeps observability, all-Rust path, single binary, cross-platform parity. Lists required skills for execution (rust-style, rustdoc, rust-analyzer-ssr, superpowers TDD/verification, repo verify/profile). Maps the work into 5+1 phases with per-phase plan-doc placeholders. - 2026-04-27-smoltcp-passt-port-phase0.md (Phase 0 plan) 25 bite-sized TDD tasks: correctness baseline pins, divan microbenches, wall-clock e2e harness, NetworkBackend trait extraction, SlirpStack → SmoltcpBackend rename. Includes three BROKEN_ON_PURPOSE assertions that flip in later phases.

…ddress

Add two baseline tests for the smoltcp DNS proxy: - dns_query_resolves: sends a query for example.com, polls ≤20×100ms, asserts reply XID matches. - dns_cache_keys_by_question_not_xid: warms cache with xid=1, then queries with xid=2 and asserts the stack rewrites the reply XID. Both tests skip gracefully (eprintln + early return) when the upstream resolver is unreachable, making them safe in offline CI. Also adds QNAME_EXAMPLE_COM const and two module-scope helpers: build_dns_query (builds a correct UDP DNS frame with proper payload_len) and parse_dns_reply_xid. SLIRP_DNS_IP added to the existing module-scope slirp import.

Implement measure_tcp_throughput_g2h: binds a host-side TCP listener, boots a VM, execs dd|nc in the guest, drains to EOF on the host, and computes Mbps from bytes_received / elapsed. h2g left None with a TODO.

Implements measure_rr_latency and measure_crr_latency in voidbox-network-bench, reusing the single shared VM booted for throughput measurements. RR: guest pipes N bytes over one persistent nc connection; host times each read+write pair (first sample discarded to absorb connect jitter). CRR: guest runs N independent nc invocations; host times each full accept+read+write+close cycle. Both use the existing percentile() helper (dead_code attribute removed). Latency measurements always run regardless of --no-throughput.

Per user feedback: "Slirp" denotes the user-mode-NAT role; "smoltcp" is the underlying library. Role-based naming keeps the public type surface stable across library swaps and matches the symmetry of future TapBackend / VhostNetBackend siblings. Module file src/network/slirp.rs keeps its name (already aligned with the new type, matches src/devices/virtio_net.rs convention).

The actual polling logic now lives in drain_to_guest, which writes directly into the caller-supplied &mut Vec<Vec<u8>> buffer — no fresh allocation on every tick. poll becomes a #[deprecated] shim: #[deprecated(note = "use drain_to_guest")] pub fn poll(&mut self) -> Vec<Vec<u8>> { let mut out = Vec::new(); self.drain_to_guest(&mut out); out } Existing call sites (virtio_net.rs, tests/network_baseline.rs, benches/network.rs) are annotated with #[allow(deprecated)] and a TODO(0D.4/0D.5) marker. They will be migrated in the next two tasks, after which the allow attributes can be removed.

Switch VirtioNetDevice::slirp from Arc<Mutex<SlirpStack>> to Arc<Mutex<dyn NetworkBackend>>, replacing the deprecated poll() call in get_rx_frames with drain_to_guest into a reused rx_scratch buffer. Update both VMM cold-boot and snapshot-restore construction sites to coerce Arc<Mutex<SlirpStack>> to the trait object. All 14 baseline tests pass; fmt and clippy clean.

Type rename only — the slirp.rs module file keeps its name. SlirpBackend reflects the user-mode-NAT role rather than the underlying smoltcp library, keeping naming symmetric with future TapBackend / VhostNetBackend siblings.

Introduces the types and helper needed for ICMP echo NAT (Phase 1): - IcmpEchoKey {guest_id, dst_ip}: hash key for the echo NAT table. - IcmpEchoEntry {sock, guest_id, last_activity}: per-request state. - open_icmp_socket(): opens SOCK_DGRAM/IPPROTO_ICMP (no CAP_NET_RAW). - icmp_echo: HashMap<IcmpEchoKey, IcmpEchoEntry> field on SlirpBackend, initialized to HashMap::new() in with_security() (the canonical ctor; new() and Default both delegate through it). No behavior change — handle_ipv4_frame is untouched, the map stays empty. Dead-code allowances are scoped to the new items and will be removed once tasks 1.2/1.3 wire them in.

Divan microbench (`tcp_rx_latency_one_packet`) measures the SLIRP-layer per-packet dispatch cost when one TCP flow is Established and the host kernel has data ready: one zero-timeout epoll_wait + readiness scan + peek + Ethernet frame construction. Measured median on this host: ~9.8 µs per drain_to_guest call. Pre-6.4 the relay iterated every flow in flow_table unconditionally regardless of readiness. Post-6.4 it dispatches only the flows with an epoll EPOLLIN event, reducing wasted work on idle flows to zero. This bench is the regression anchor for that change. The bench is gated on `--features bench-helpers` (like the existing `tcp_inbound_syn_ack_transition` and `synthesize_inbound_syn` benches). It performs a full 3-way handshake outside the timed loop so only the hot relay path is measured. Note: this bench cannot exercise the net_poll_thread 50 ms epoll cycle (that thread does not run inside divan). The wall-clock host→guest latency floor is the province of voidbox-network-bench's `tcp_rx_latency_us_p50` field. That field is added to the Report struct in this commit but returns None (deferred): wiring a guest-side listener requires either a guest daemon or an additional exec RPC — both out of scope for Phase 6.4. The divan microbench is the primary numerical deliverable for this phase.

net_poll_thread holds the EpollDispatch mutex for the full 50 ms of its blocking wait. drain_to_guest's own non-blocking wait_with_timeout(ZERO) call contended on the same mutex, serializing the vCPU thread behind the net-poll thread. voidbox-network-bench saw TCP g2h throughput drop from ~1885 Mbps to ~44 Mbps (40× regression). Fix: SlirpBackend gets a small Mutex<Vec<EpollEvent>> queue. net_poll_thread pushes events into it after each successful wait_with_timeout. drain_to_guest drains the queue (brief uncontended lock) without touching EpollDispatch. A try_lock fallback path serves unit tests (no net_poll_thread) without blocking on the mutex. NetworkBackend trait gains a push_ready_events default-no-op so SlirpBackend can override it; VirtioNetDevice exposes push_events_to_backend as the trampoline called by net_poll_thread. Off-CPU profile evidence: drain_to_guest was 9% off-CPU (29.7s in a 60s window) waiting on the epoll mutex; should drop to near-zero post-fix.

relay_tcp_nat_data's Pass 1 unconditionally copied every TCP FlowKey into a Vec to scan for Closed entries on every drain call. Cache misses 47/1K under load; poll_with_n_flows/100 regressed +246% (130ns → 450ns), /1000 regressed +220%. Fix: when handle_tcp_frame's FIN/RST handlers and mid-function error paths set state=Closed, push the key onto a pending_close Vec. relay_tcp_nat_data drains this Vec at the top of its single ready-events pass — no O(n) collect required. Idle-timeout detection retains a direct flow_table iteration but without allocating a separate key Vec.

…_guest The initial Bug A fix used pending_events.lock() + try_lock(epoll) in drain_to_guest's fast path, adding ~150ns overhead per call vs Phase 6.4 (one extra Mutex acquire). This showed as +38% regression in poll_idle bench (441ns → 611ns). Revised approach: try_lock epoll first (zero cost when uncontended — tests, benches, idle net-poll thread). On Err (net_poll_thread holds the mutex for 50ms), drain pending_events instead. In production the try_lock fails ~once per 50ms window; in tests it always succeeds. Net result: drain_to_guest overhead matches Phase 6.4 when epoll is uncontended; contention eliminated when net_poll_thread is actively waiting.

Pre-Phase-6.4, net_poll_thread woke unconditionally every 5 ms, so every ACK queued in inject_to_guest by handle_tcp_frame got flushed within 5 ms. Phase 6.4's epoll_wait(50 ms) waits for FD readiness events — but a guest writing data has no FD-side signal (the guest is the writer; the SLIRP-side socket only becomes readable when the host responds). So queued ACKs sat 50 ms before being flushed; TCP send window stalled; voidbox-network-bench TCP g2h dropped from ~1885 Mbps to ~225 Mbps even after the mutex-contention fix. Fix: track inject_to_guest length around process_guest_frame's ethertype dispatch. If the call queued any frames, call epoll_waker.wake() — one byte to the non-blocking self-pipe, which unblocks net_poll_thread's epoll_wait so the queued frames flush within microseconds. Also fixes the related drain_to_guest event-source ordering bug: pending_events (filled by net_poll_thread) is now ALWAYS drained first, with the non-blocking epoll poll only running as a fallback when the queue is empty (test/bench paths without net_poll_thread). The previous code took the try_lock branch when net-poll was between iterations and silently dropped events the net-poll thread had already pushed. voidbox-network-bench post-fix: g2h: ~6000 Mbps (vs master 1885; +3.2x) bulk-g2h: ~3900 Mbps (vs master 1565; +2.5x at SO_RCVBUF=4096) rr p50: 2 us (parity with master) crr p50: ~50 ms (5x regression vs master ~10 ms — separate bug, tracked in follow-up; the 50 ms is exactly one epoll_wait cycle and points to a connection-establishment latency issue independent of the throughput path)

CRR p50 was regressing +40 ms (10 ms → 51 ms) post-Phase-6.4. The +40 ms exactly matches Linux's TCP delayed-ACK timer, and the cause is that Phase 6.4 widened the net-poll IRQ re-pulse cadence from 5 ms to 50 ms. The Linux guest spends most idle time in HLT and relies on regular vCPU scheduling slots — driven by our IRQ pulses — to advance its TCP delayed-ACK timer. At 50 ms cadence the guest's pure ACKs ride the next event-triggered IRQ, which can be 40+ ms away. At 5 ms the housekeeping cadence mirrors pre-6.4 and the timer fires on schedule. We lose Phase 6.4's headline "10x idle-wakeup reduction" goal but fast-path events still wake immediately via epoll readiness — so the net win vs master is unchanged: g2h throughput +250%, bulk throughput +250%, RR parity, CRR parity. voidbox-network-bench post-fix: g2h: ~6500 Mbps (vs master 1885; +247%) bulk-g2h: ~5400 Mbps (vs master 1565; +245%) rr p50: ~3 us (parity) crr p50: ~10100 us (parity — back to baseline 10 ms)

Per AGENTS.md doc-comment style ("avoid ticket IDs and PR/commit references inside doc comments and inline comments — they belong in commit messages and PR descriptions where they're audit trail; in code they age into noise as the ticketing context evolves"). Phase references fall into the same category. Comments are rewritten in present tense to explain the structural reasoning without referencing when each piece landed. Identifiers like test names and BROKEN_ON_PURPOSE markers are unchanged. Plan/spec docs in docs/superpowers/plans/ are intentionally untouched — phase references there ARE the audit trail.

Recovers Phase 6.4's headline 10x idle-wakeup reduction without re-introducing the +40 ms CRR regression that forced the cadence back to a fixed 5 ms. The adaptive policy: - last cycle had any kernel event → next timeout 5 ms (active) - last cycle timed out (no events) → next timeout 50 ms (idle) A single quiet cycle drops us to idle; a single event puts us back in active in the next cycle. The subtlety that motivated the additional EpollDispatch change: when the vCPU thread calls epoll_waker.wake() during a 50 ms idle wait, the kernel's epoll_wait returns with the self-pipe event. wait_with_timeout filters that event out and drains the pipe — so `epoll_events.is_empty()` would have remained true, and the naive "is_empty ⇒ idle" predicate kept us at 50 ms forever, regressing CRR p50 back to ~50 ms. wait_with_timeout now returns the *raw* kernel count (including self-pipe wakes) so the adaptive policy treats wakes as activity. Filtered events still arrive in the out parameter unchanged; only the return value's meaning shifted from "observable count" to "raw count," which all existing callers ignore. voidbox-network-bench post-fix: g2h: ~6680 Mbps (vs 5 ms fixed: 6500; vs master: +254%) bulk-g2h: ~5550 Mbps (vs 5 ms fixed: 5400; vs master: +254%) rr p50: 1 us (in 99-sample iteration; parity) crr p50: ~10100 us (parity preserved — adaptive correctly holds 5 ms cadence during connection bursts because each connection's wake() keeps raw_kernel_events > 0) Idle CPU dropped: profile-pre showed net_poll_thread on-CPU 4.93 % of total at fixed 5 ms cadence (200 wakes/sec); adaptive should drop to ~10x lower during idle stretches between iterations.

Copilot

Pull request overview

This PR (Phase 6.4) rewrites the VM’s host→guest RX polling path from a timer-driven loop to an epoll-driven dispatch with an adaptive timeout, aiming to reduce unnecessary wakeups while improving throughput/latency under load. It introduces a shared EpollDispatch owned by SlirpBackend, wires readiness events through net_poll_thread, and updates tests/benches/docs to match the new design.

Changes:

Added a Linux EpollDispatch (epoll fd + self-pipe waker) and integrated it into SlirpBackend for flow registration and readiness-driven relay.
Reworked net_poll_thread to block on epoll with adaptive 5ms/50ms timeouts and forward ready events into the backend via a per-backend queue.
Updated baseline tests, microbenches, and plan documentation to reflect the new event-driven polling architecture and snapshot/restore epoll rebuild expectations.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
tests/network_baseline.rs	Updates drain helper, retransmit semantics, and adds an epoll rebuild smoke test (feature-gated).
src/vmm/mod.rs	Replaces fixed sleep polling with epoll-driven net poll thread and adaptive timeout logic.
src/network/slirp.rs	Adds epoll dispatcher/waker, registers host FDs on flow creation, and drives relay via readiness events + event queue.
src/network/nat.rs	Documentation-only wording updates.
src/network/mod.rs	Exposes epoll hooks (`epoll_arc`, `push_ready_events`) on `NetworkBackend` (Linux-only).
src/network/epoll_dispatch.rs	New epoll wrapper with register/unregister, wait_with_timeout, and self-pipe `Waker`.
src/devices/virtio_net.rs	Plumbs epoll access/events forwarding from device to backend (Linux-only).
src/daemon.rs	Removes phase wording from a comment.
src/bin/voidbox-startup-bench/main.rs	Removes phase wording from output strings.
src/bin/voidbox-network-bench/main.rs	Updates backpressure-related documentation strings; adds placeholder RX latency field.
docs/superpowers/plans/2026-04-30-smoltcp-passt-port-phase6.4.md	Adds the detailed Phase 6.4 implementation plan.
benches/network.rs	Updates comments and adds a bench for one-packet RX latency cost under epoll dispatch.

Comments suppressed due to low confidence (1)

src/network/slirp.rs:1907

relay_icmp_echo no longer performs a periodic idle-timeout sweep; it only checks ICMP_IDLE_TIMEOUT for flows that appear in the ready set. ICMP flows that go silent (no more readiness events) will never be reaped, leaving sockets registered in epoll indefinitely. Reintroduce a lightweight sweep over flow_table (similar to UDP’s stale scan) or track expirations separately so idle ICMP entries are eventually evicted even without further traffic.

    /// Drain replies from each active ICMP echo socket and emit echo-reply
    /// frames to the guest, driven by epoll readiness.
    ///
    /// Only flows whose token appears in `ready` with EPOLLIN set are visited.
    /// Entries idle longer than `ICMP_IDLE_TIMEOUT` are still evicted on any
    /// readiness event for that flow.
    fn relay_icmp_echo(&mut self, ready: &[EpollEvent]) {
        const ICMP_IDLE_TIMEOUT: Duration = Duration::from_secs(60);
        let now = Instant::now();

        let flow_keys: Vec<FlowKey> = ready
            .iter()
            .filter(|ev| ev.readable && ev.token & PROTO_TAG_MASK == PROTO_TAG_ICMP)
            .filter_map(|ev| {
                self.flow_table.keys().copied().find(|fk| {
                    if let FlowKey::IcmpEcho(icmp_key) = fk {
                        flow_token_for_icmp(icmp_key) == ev.token
                    } else {
                        false
                    }
                })
            })
            .collect();
        for flow_key in flow_keys {
            let FlowKey::IcmpEcho(key) = flow_key else {
                continue;
            };
            let frame = {
                let Some(FlowEntry::IcmpEcho(entry)) = self.flow_table.get_mut(&flow_key) else {
                    continue;
                };
                if now.duration_since(entry.last_activity) > ICMP_IDLE_TIMEOUT {
                    None // mark for removal below
                } else {

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+        if let Some(host_fd) = new_host_fd {
+            let token = flow_token_for_udp(&key);
+            self.epoll
+                .lock()
+                .unwrap()
+                .register(host_fd, token, true, false)
+                .ok();
+            self.epoll_waker.wake();
+        }


+        if let Some(host_fd) = new_icmp_fd {
+            let token = flow_token_for_icmp(&key);
+            self.epoll
+                .lock()
+                .unwrap()
+                .register(host_fd, token, true, false)
+                .ok();
+            self.epoll_waker.wake();
+        }


+        // Idle-timeout sweep: scan flow_table once without collecting a
+        // separate key Vec. 300-second inactivity applies regardless of epoll
+        // readiness; this is O(n) in the number of TCP flows but has no
+        // heap allocation overhead.
+        const TCP_IDLE_TIMEOUT: Duration = Duration::from_secs(300);
+        for (flow_key, entry) in &self.flow_table {
+            if let FlowEntry::Tcp(tcp_entry) = entry {
+                if tcp_entry.last_activity.elapsed() > TCP_IDLE_TIMEOUT
+                    && !to_remove.contains(flow_key)
+                {
+                    to_remove.push(*flow_key);
+                }


flow_token_for_tcp/udp truncated dst_ip to 16 bits; flow_token_for_icmp omitted dst_ip entirely. Multiple flows could collide on the same token, mis-routing readiness events to the wrong FlowKey. Replace the lossy encoding with a monotonic AtomicU64 counter per backend. Tokens are still tagged in the high byte for protocol demux (PROTO_TAG_TCP/UDP/ICMP); the lower 56 bits are unique. A new token_to_key HashMap makes readiness → FlowKey lookup O(1) instead of the previous linear flow_table scan.

net_poll_thread held the Mutex<EpollDispatch> across the blocking epoll_wait call (up to 50 ms in idle cadence). vCPU register/ unregister paths in handle_tcp_frame (and friends) had to acquire the same mutex and would block behind the wait, stalling guest TCP SYN handling for up to 50 ms during connection setup. epoll_ctl and epoll_wait are kernel-thread-safe on the same epoll fd; the only state requiring synchronization was the self-pipe (now eagerly initialized in EpollDispatch::new) and the registered fd count (now AtomicUsize). EpollDispatch becomes Sync without an external Mutex — the type changes from Arc<Mutex<EpollDispatch>> to Arc<EpollDispatch>. register/unregister run lock-free against the wait thread; only the kernel's per-epoll-fd internal lock serializes, and that's a fast path.

to_remove.contains() inside the idle-timeout loop was O(n*k) under churn. Switch the membership check to a HashSet<FlowKey> and only materialize the Vec once at the end for the removal loop.

Apply project rust-style rules to the recently-landed Phase 6.4 code (token rewrite + lock-free EpollDispatch refactor): 1. RegisterMode enum replaces (readable: bool, writable: bool) on EpollDispatch::register. Closed-set policy at the call site (Read / Write / ReadWrite) over two opaque booleans. 2. matches!() removed at three sites — the project guide prefers full match (or boolean ==) for compiler diagnostics if the matched type changes. The unprivileged-ICMP errno check now uses == comparisons; FlowKey::Tcp counter uses a for loop. 3. Iterator chains in relay loops rewritten as for loops with mutable accumulators per the project rule. relay_tcp_nat_data, relay_udp_flows, relay_icmp_echo all touched. Logic unchanged; control flow now reads top-down without a chain of .filter().filter_map().collect(). 4. Local renamed `rc` → `epoll_ctl_result` at three EpollDispatch sites. Role-bearing names are required in non-tiny scopes. 5. Dropped redundant explanatory comments around the relay loops ("Data relay — only for flows with…", "Skip entries already queued for…", "Collect ready ICMP flow keys via…"). The code below them is self-describing. Kept structural "why" comments (the ICMP idle-sweep rationale, the per-flow socket Drop contract). No behavior change. cargo fmt, clippy -D warnings, network_baseline (18/18), lib network (23/23), and voidbox-network-bench wall-clock (g2h ~6580 Mbps, CRR ~32 µs) all green.

Two artifacts from the rebase onto the new #69 tip: 1. bind_port_forward_listeners was called twice in with_security after the rebase — the second call shadowed the first, which was harmless but emitted an unused-variable warning under -D warnings. Drop the duplicate. 2. The bench-helpers helper insert_synthetic_lastack_entry was missing flow_token in its TcpNatEntry initializer (the field was added by the Copilot fix #1 monotonic-token rewrite that landed on #69 after the original phase 6.1 commit chain was written). Allocate via next_flow_token(PROTO_TAG_TCP).

…ase6.4-epoll # Conflicts: # benches/network.rs # src/bin/voidbox-network-bench/main.rs # src/network/mod.rs # src/network/nat.rs # src/network/slirp.rs # tests/network_baseline.rs

…ing (#76) * docs: Phase 6.1 detailed TDD plan — TCP half-close 9 bite-sized tasks covering the half-close state machine: - Established → FinWait1 (guest closed first) → LastAck on host EOF - Established → CloseWait (host closed first) → LastAck on guest FIN - LastAck → Closed on guest's final ACK or LAST_ACK_TIMEOUT (60s) Three new pins: tcp_half_close_guest_writes_first (BROKEN_ON_PURPOSE flips at Task 3), tcp_half_close_host_writes_first (passes after Task 3), tcp_last_ack_timeout_reaps_stale_entry (bench-helpers gated, Task 7). Severity: HIGH — current code drops host's pending response on guest shutdown(SHUT_WR), causing silent data loss for HTTP, SMTP, and any protocol with orderly half-close. * test(network): pin tcp_half_close_guest_writes_first (BROKEN_ON_PURPOSE) Guest sends FIN after data; current code marks state=Closed immediately on guest FIN, so the host's response data is dropped. This pin locks the broken behavior so the Phase 6.1 fix is legible to reviewers. Flips to PASS when Tasks 2–3 implementation lands. * feat(slirp): FinWait1 → LastAck and Established → CloseWait on host EOF Tasks 2+3+4 implemented together: clippy's -D warnings requires our_fin_sent to be used before committing the field addition, so the relay loop changes from Task 3 and the ACK handler from Task 4 are bundled with Task 2's struct changes. Structural changes (TcpNatEntry): - Add last_state_change: Instant — tracks when state last changed, used for LAST_ACK_TIMEOUT reaping (Task 6). - Add our_fin_sent: bool — prevents re-sending FIN on repeated epoll readiness events for the same half-close transition. - Initialize both fields at all three TcpNatEntry literal sites. FIN handler (handle_tcp_frame): - Established → FinWait1: ACK guest FIN, shutdown(Write) host socket so host sees EOF; stay alive for host's pending response data. - CloseWait → LastAck: host already closed; guest just closed; ACK and wait for guest's final ACK of our FIN. ACK handler (handle_tcp_frame): - LastAck → Closed: guest's final ACK reaps the entry. - Extend ACK-driven consume to FinWait1 state so bytes_in_flight drains and recv_peek eventually sees Ok(0) (host EOF). Relay loop (relay_tcp_nat_data): - Extend data relay to FinWait1 in addition to Established. - Ok(0) arm now dispatches on state: Established → CloseWait, FinWait1 → LastAck (instead of immediate Closed). - FIN-emit logic uses our_fin_sent guard and needs_fin predicate (CloseWait | LastAck) to send FIN exactly once per transition. Test: update tcp_half_close_guest_writes_first to ACK received data segments so the ACK-driven consume path drains the kernel buffer and recv_peek sees EOF. Without ACKs the FinWait1 relay sees the same bytes on every peek and never reaches Ok(0). * test(network): pin tcp_half_close_host_writes_first Symmetric mirror of tcp_half_close_guest_writes_first: host writes GREETING, shuts down its write side (Established → CloseWait), and waits for guest to reply; guest sends REPLY data + FIN. Also fix two implementation gaps found while writing this pin: 1. Guest→host data forwarding in CloseWait: the payload forward guard was Established-only. CloseWait must also forward guest data (host closed its write side but can still read). Extend the predicate to include CloseWait. 2. shutdown(Write) on CloseWait → LastAck: when the guest sends FIN in CloseWait the host application is blocked on read_to_end. We must call shutdown(Write) on host_stream so the kernel delivers EOF to the host. Without this the host blocks forever. * feat(slirp): LAST_ACK_TIMEOUT reaping prevents LastAck entry leak Add LAST_ACK_TIMEOUT = 60 s (TCP MSL × 2) as a module-scope constant. Merge the LastAck-timeout check into the existing idle-timeout sweep in relay_tcp_nat_data, making a single O(n) pass handle both conditions: - Standard idle-timeout (300 s of inactivity, any state): unchanged. - LastAck-timeout (60 s since last_state_change, LastAck state): reaps entries where the guest's final ACK never arrived, e.g. after a guest crash. Logs at WARN so operators can connect the timeout to a prior half-close sequence without a separate debugging session. The combined sweep avoids a second loop and an extra to_remove Vec allocation. Using last_state_change (set on every state transition) rather than last_activity (set on data relay) gives an accurate 60 s window from the moment we sent our FIN. * test(network): pin tcp_last_ack_timeout_reaps_stale_entry (bench-helpers) Add a bench-helpers–gated pin that verifies the LAST_ACK_TIMEOUT reaper: 1. Insert a synthetic LastAck entry via the new bench-helpers helper insert_synthetic_lastack_entry (TcpNatState::LastAck, our_fin_sent=true). 2. Back-date last_state_change by 70 s (> LAST_ACK_TIMEOUT = 60 s) via set_synthetic_last_state_change. 3. One drain_to_guest cycle runs the timeout sweep. 4. Assert the entry is gone (tcp_flow_state returns None). Two new bench-helpers methods on SlirpBackend: - insert_synthetic_lastack_entry: seeds a LastAck flow without a full half-close exchange, for timeout reaping tests. - set_synthetic_last_state_change: back-dates last_state_change on an existing entry to simulate an expired LAST_ACK_TIMEOUT. Also widen TcpNatState and tcp_flow_state to pub so the bench-helpers– gated external test can reference them by path. Both were previously pub(crate); they are now observable as part of the bench-helpers surface. * refactor(slirp): drop allow(dead_code) on TcpNatState; remove unused FinWait2 All TcpNatState variants are now wired into the state machine: - SynReceived, SynSent: handshake paths (unchanged) - Established: normal data transfer (unchanged) - FinWait1: guest half-closed; relay continues forwarding host response - CloseWait: host half-closed; relay continues forwarding guest data - LastAck: both sides closed; waiting for guest's final ACK or timeout - Closed: entry pending removal Remove FinWait2: distinguishing it from FinWait1 would require observing per-segment ACKs from the kernel — the relay does not track those. The relay stays in FinWait1 until host EOF, then jumps to LastAck. If the distinction is needed later, the variant can be re-added. Update the TcpNatState doc comment to document the state machine diagram and the FinWait2 omission rationale, replacing the old per-variant stubs. * fix(slirp): drop duplicate bind + init flow_token in lastack helper Two artifacts from the rebase onto the new #69 tip: 1. bind_port_forward_listeners was called twice in with_security after the rebase — the second call shadowed the first, which was harmless but emitted an unused-variable warning under -D warnings. Drop the duplicate. 2. The bench-helpers helper insert_synthetic_lastack_entry was missing flow_token in its TcpNatEntry initializer (the field was added by the Copilot fix #1 monotonic-token rewrite that landed on #69 after the original phase 6.1 commit chain was written). Allocate via next_flow_token(PROTO_TAG_TCP). * fix(slirp): doc-link to bench-helpers method via Self:: A bare intra-doc-link (`[`set_synthetic_last_state_change`]`) doesn't resolve to a struct method even when both items live in the same `impl` block — rustdoc looks up bare names in the module scope, not the enclosing impl. CI's `-D warnings` then elevates the `broken_intra_doc_links` warning to an error and fails the Documentation job. Switch to `Self::set_synthetic_last_state_change` so rustdoc resolves through the impl.

dpsoft added 30 commits April 27, 2026 17:52

test(network): scaffold network_baseline pins with frame helpers

8d63aaa

test(network): address review — restore reserved constants, alias IpA…

bc9eefb

…ddress

test(network): pin TCP handshake SYN-ACK emission

21134d8

test(network): pin TCP guest↔host data round-trip

1226986

test(network): BROKEN_ON_PURPOSE pin — 256 KB to_host cliff

5838586

test(network): hoist inline use statements to module scope

6cc850c

test(network): pin TCP rate limit, concurrent cap, deny list

a5b9128

test(network): pin ARP reply behavior for gateway and subnet

cf59b33

test(network): fix build_udp_frame payload_len double-count

40c0f7e

test(network): BROKEN_ON_PURPOSE pin — UDP non-DNS dropped

279af3b

test(network): BROKEN_ON_PURPOSE pin — ICMP echo dropped

4d96ad7

bench(network): divan microbenches for SLIRP hot paths

41c8382

bench(network): parametric NAT-walk scaling at 1/100/1000 flows

499ee35

bench(network): DNS cache hit and miss paths

7cca766

ci(bench): include network microbenches in regression gate

7868bb2

bench(network): voidbox-network-bench binary scaffold

e1ed1e2

bench(network): TCP throughput via busybox nc

df898d6

Implement measure_tcp_throughput_g2h: binds a host-side TCP listener, boots a VM, execs dd|nc in the guest, drains to EOF on the host, and computes Mbps from bytes_received / elapsed. h2g left None with a TODO.

bench(network): UDP DNS qps and JSON report output

594190b

feat(network): introduce NetworkBackend trait

b7e426c

refactor(slirp): add drain_to_guest wrapper for trait fit

046d57d

feat(slirp): impl NetworkBackend for SlirpStack

66f007f

refactor(network): rename SlirpStack to SlirpBackend

bf3cd6a

Type rename only — the slirp.rs module file keeps its name. SlirpBackend reflects the user-mode-NAT role rather than the underlying smoltcp library, keeping naming symmetric with future TapBackend / VhostNetBackend siblings.

docs(plans): add Phase 1 plan (ICMP echo via SOCK_DGRAM IPPROTO_ICMP)

028707c

dpsoft added 8 commits May 2, 2026 17:07

dpsoft changed the title ~~Phase 6.4: epoll-driven RX dispatch (3.5× throughput vs main)~~ Phase 6.4: epoll-driven RX dispatch with adaptive timeout (3.5× throughput vs main) May 4, 2026

dpsoft requested a review from Copilot May 4, 2026 11:56

dpsoft marked this pull request as ready for review May 4, 2026 11:56

Copilot started reviewing on behalf of dpsoft May 4, 2026 11:56 View session

Copilot AI reviewed May 4, 2026

View reviewed changes

dpsoft mentioned this pull request May 4, 2026

perf(slirp): port-forward listener on epoll (50ms → 180µs accept latency) #72

Closed

3 tasks

dpsoft added 5 commits May 4, 2026 14:40

perf(slirp): O(1) dedup in idle-timeout sweep via HashSet

5560498

to_remove.contains() inside the idle-timeout loop was O(n*k) under churn. Switch the membership check to a HashSet<FlowKey> and only materialize the Vec once at the end for the removal loop.

docs(vmm): net_poll_thread doc-comment matches adaptive timeout

d43dbc4

dpsoft force-pushed the smoltcp-passt-port-phase6.4-epoll branch from dc1001b to dcbf18b Compare May 4, 2026 18:12

dpsoft mentioned this pull request May 4, 2026

Phase 6.1: TCP half-close — fix silent data loss on shutdown(SHUT_WR) #73

Closed

Base automatically changed from smoltcp-passt-port-phase0 to main May 6, 2026 00:00

Merge remote-tracking branch 'origin/main' into smoltcp-passt-port-ph…

13e70dd

…ase6.4-epoll # Conflicts: # benches/network.rs # src/bin/voidbox-network-bench/main.rs # src/network/mod.rs # src/network/nat.rs # src/network/slirp.rs # tests/network_baseline.rs

dpsoft merged commit 6a5d8b5 into main May 6, 2026
22 checks passed

dpsoft deleted the smoltcp-passt-port-phase6.4-epoll branch May 6, 2026 00:38

dpsoft mentioned this pull request May 6, 2026

Phase 6.1: TCP half-close (FinWait/CloseWait/LastAck) + LAST_ACK reaping #76

Merged

5 tasks

This was referenced May 6, 2026

perf(slirp): port-forward listener on epoll (50 ms → sub-ms accept latency) #77

Merged

Phase 6.2: async outbound connect — eliminate 3s vCPU stall #78

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase 6.4: epoll-driven RX dispatch with adaptive timeout (3.5× throughput vs main)#69

Phase 6.4: epoll-driven RX dispatch with adaptive timeout (3.5× throughput vs main)#69
dpsoft merged 122 commits intomainfrom
smoltcp-passt-port-phase6.4-epoll

dpsoft commented May 3, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dpsoft commented May 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this branch does

Headline perf vs `origin/main` (`voidbox-network-bench --iterations 3 --bulk-mb 10`)

Divan microbench wins (`scripts/bench-compare.sh --baseline 47868f0 --skip-vm`)

Adaptive timeout: how it preserves correctness

Inherent microbench overhead (test/bench fallback path only)

Iteration history

Profile evidence (post-fix, 60 s capture during voidbox-network-bench)

Validation

Open follow-ups (deferred to future PRs, NOT blocking)

Test plan for review

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dpsoft commented May 3, 2026 •

edited

Loading

Divan microbench wins (`scripts/bench-compare.sh --baseline `47868f0` --skip-vm`)